1 Aim

A PanNET classifier was built by Marco Visano as part of his student project (043_PanNET_classifier_student_project). For this classifier feature selection was done unsupervised after training the random forest on all available probes.

Here, feature selection is performed on prior knowledge. The same DMPs used for consensus clustering are also used as features for training the model. The expectation is that using the same probes will increase the precision of classifier. Also, using a fixed set of probes allows splitting the available data into a training and test sample.

2 Loading data

The 155 PanNET samples used to define the epigenetic groups are loaded as described in the 030_NETG1G2_EPIC script 220609_NETG1G2_EPIC_consensus_clustering_plus_metastases.

3 General parameters

3.1 Data set

Analogous to the classifier trained by Marco, all 155 samples will be used as a training set. Model performance will then be evaluated using cross validation. The samples of the NETG1G2 cohort will be used as an independent testing group.

It needs to be noted that the new testing data will be based on the EPIC technology. The training data does contain only few samples using the EPIC technology. Therefore, technical differences may impact model performance. However, all samples to be added to the classifier are expected to be based on EPIC.

In the classifier trained by Marco training was performed on the ComBat corrected data. The same data was used for cross validation which constitutes a potential data leak leading to overestimated model performance. Due to the low number of EPIC samples it is not possible to reliably perform ComBat batch correction in the cross validation folds. Therefore, all training is performed on the normalized but not batch corrected data. This will allow a more realistic evaluation of model performance.

In the initial analysis the only technical variable associated with the data was the array position. It would be possible to use this correction as part of cross validation.

3.2 Feature selection

In Capper 2018 feature selection was performed based on the random forest importance. In this workflow feature selection is performed outside of cross validation. This means that there is data leakage between test and training data in the cross validation leading to overoptimistic model estimates.

The same approach was taken by Marco potentially explaining the reduced precision when predicting the NETG1G2 data compared to cross validation.

As an alternative it is possible to select the same probes that were used for consensus clustering. This still presents a case of data leakage during cross validation, because probe selection is done once at the start of training. However, the probes are not selected directly on the data set but only the 125 samples from Di Domenico 2020. This partially mitigates the issue of overconfidence.

To reduce redundancy in the data, highly correlated probes (Pearson’s rho > 0.8) can be removed. While correlated probes are not problematic for random forest classifierts, they pose serious problems for other machine learning models. In a random forest model correlated features will split the feature importance measure, potentially excluding relevant probes. This mainly is important when performing feature selection based on the intrinsic random forest importance measure.

3.3 Feature engineering

The methylation data is already normalized and no further feature engineering is required.

To combat the class imbalance, the minority classes are over sampled to match the largest class.

3.4 Cross validation

Cross validation is performed to evaluate model performance and tune model hyperparameters.

  • 155 samples
  • Outer: 4 fold cross validation (5 repetitions)
  • Inner: 5 bootstraps

3.5 Metrics

The model performance can be scored using different summary statistics.

  • Accuracy: Fraction of samples assigned the correct class

4 Random forest

Parameter tuning

For mtry there is an default value of \(\sqrt{n\_probes}\). The search space is arranged symmetrically around this value.

4.1 All DMPs - ranger

The ranger random forest implementation is run including all DMPs.

4.1.1 ROC AUC

Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.

  • The tuning parameters have very limited impact on model performance

4.1.2 Accuracy

Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.

  • Also tuning does not improve on accuracy

Based on the model tuning the mtry is set to 75 and n_trees to 500.

4.2 Decorrelated DMPs - ranger

The ranger random forest implementation is run on the DMPs after removing highly correlated (pearson’s rho > 0.8) probes. This increases the reliability of the estimates of probe importance that are split across highly correlated signal.

4.2.1 ROC AUC

Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.

  • Similar to the tuning including all probes the hyperparameters have little impact on model performance
  • The overall model performance is slightly reduced

4.2.2 Accuracy

Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.

  • Also for accuracy the drop in performance is visible.

While improving interpretability of the probe importance scores, removing highly correlated probes does remove signal from the data. This seems to negatively impact model performance. Therefore the model including all DMPs is used over the filtered model.

4.3 All DMPs - randomForest

A second implementation of random forests in the randomForest package is used on all DMPs.

4.3.1 ROC AUC

Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.

  • Performance is very similar to the ranger model

4.3.2 ROC AUC

Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.

  • Performance is very similar to the ranger model

The randomForest implementation is markedly slower than ranger. Therefore ranger will be used for further evaluations.

5 Model evaluation

After tuning hyperparameters, the model performance is evaluated using the outer cross validation folds. More detailed statistics are collected to identify sources of error in the model.

5.1 Hard class performance metrics

  • accuracy: sensitive to class imbalances
  • kappa (kap): normalized accuracy, less sensitive to class imbalances
  • mcc: measure of association, less sensitive to class imbalances
  • sensitivity, recall, true positive rate: probability of correctly labeling a true positive
  • specificity, true negative rate: probability of correctly labeling a true negative
  • j_index, informedness: sensitivity + specificity - 1
  • balanced_accuracy: mean of sensitivity and specificity
  • precision: probability of true positives in the result
  • F_means: harmonic mean of precision and recall (sensitivity); by default precision and recall are weight equally

Most of these metrics with the exception of the accuracy, mcc and kap are only defined for binary classifications. To extend these metrics to multiclass cases, the results can be averaged with and without weighting by class size (macro and macro_weighted). It is also possible to binarize all classifications into correctly classfied and wrongly classified (micro)

Confidence intervals are calculated from the t-distribution with 4 degrees of freedom.

Default averaging
.metric .estimator mean sd se CI_lower CI_upper
accuracy multiclass 0.858 0.014 0.006 0.841 0.875
mcc multiclass 0.800 0.019 0.009 0.776 0.824
kap multiclass 0.798 0.020 0.009 0.772 0.823
sens macro 0.832 0.031 0.014 0.794 0.870
spec macro 0.959 0.005 0.002 0.953 0.964
j_index macro 0.791 0.034 0.015 0.749 0.833
bal_accuracy macro 0.896 0.017 0.008 0.874 0.917
precision macro 0.863 0.019 0.008 0.839 0.886
f_meas macro 0.842 0.025 0.011 0.810 0.873
  • The accuracy gives a higher estimate than the adjusted metrics mcc and kap
  • specificity is higher than sensitivity likely due to the class imbalances
  • The imbalance is captured by the j_index but not the balanced accuracy
  • precision is higher than sensitivity
Weighted averaging
.metric .estimator mean sd se CI_lower CI_upper
accuracy multiclass 0.858 0.014 0.006 0.841 0.875
mcc multiclass 0.800 0.019 0.009 0.776 0.824
kap multiclass 0.798 0.020 0.009 0.772 0.823
sens macro_weighted 0.858 0.014 0.006 0.841 0.875
spec macro_weighted 0.935 0.009 0.004 0.924 0.947
j_index macro_weighted 0.794 0.023 0.010 0.766 0.821
bal_accuracy macro_weighted 0.897 0.011 0.005 0.883 0.911
precision macro_weighted 0.856 0.014 0.006 0.839 0.873
f_meas macro_weighted 0.853 0.016 0.007 0.833 0.873
  • When using weighted averaging the sensitivity and accuracy are the same
  • Specificity does decrease (likely because senitivity is highest in the smallest groups)

5.2 Soft classifcation metrics

  • roc_auc: How well does the metric (e.g. random forest score) separate the classes
  • pr_auc: are under the precision - recall curve
  • gain_capture: are under the maximal gain curve
  • mn_log_loss: modification of accuracy penalizing low confidence predictions

For multclass scenarios roc_auc and pr_auc need to be summarized. For the roc_auc curve the hand-till procedure is insensitive to class imbalances.

.metric .estimator mean sd se CI_lower CI_upper
roc_auc hand_till 0.973 0.002 0.001 0.970 0.975
pr_auc macro 0.931 0.006 0.003 0.923 0.938
gain_capture macro 0.944 0.003 0.001 0.940 0.948
mn_log_loss multiclass 0.584 0.008 0.003 0.574 0.593
  • As seen before the roc_auc always is very high
  • pr_auc is slightly lower as is the gain capture metric
  • The log loss is lower compared to Marco’s classifier (despite similar accuracy) on the uncalibrated scores

For roc_auc, pr_auc and gain_capture measures the individual curves can be shown

5.2.1 roc_auc

  • For Beta_like and Intermediate_WT the separation of scores is most clear
  • The Intermediate_ADM_WT and Alpha_like groups allow early misclassifications

5.2.2 pr_auc

  • As expected, recall is highest in the two smalles groups Beta_like and Intermediate_WT
  • In the Intermediate_WT group there are large differences between the repeats (with small sample size group imbalances may be come exagerated)

5.3 gain curve

  • The interpretation of the gain curve is quite similar to the roc curve

5.4 Joining groups

Much of the uncertainty of the model is within the Intermediate_ADM and Intermediate_WT groups. Differentiating between these groups may be harder than differntiating between and Intermediate and more differentiated (Alpha_ and Beta_like) phenotype.

To see how this affects model interpretation, all Intermediate levels are merged. Likewise, Alpha_ and Beta_like levels are merged.

Many of the metrics depend on the chosen test level. Therefore the results are shown for both directions.

Test level: Alpha_ + Beta_like
.metric .estimator mean sd se CI_lower CI_upper
accuracy binary 0.937 0.005 0.002 0.930 0.943
mcc binary 0.845 0.011 0.005 0.832 0.859
kap binary 0.844 0.012 0.005 0.829 0.859
sens binary 0.924 0.011 0.005 0.911 0.937
spec binary 0.942 0.010 0.005 0.929 0.954
j_index binary 0.865 0.007 0.003 0.857 0.874
bal_accuracy binary 0.933 0.003 0.002 0.928 0.937
precision binary 0.855 0.021 0.009 0.830 0.881
f_meas binary 0.888 0.008 0.004 0.878 0.898
  • All general measures of accuracy are increased over the five class case
  • There are fewer Alpha_ and Beta_like samples reducing the precision
Test level: Intermediate
.metric .estimator mean sd se CI_lower CI_upper
accuracy binary 0.937 0.005 0.002 0.930 0.943
mcc binary 0.845 0.011 0.005 0.832 0.859
kap binary 0.844 0.012 0.005 0.829 0.859
sens binary 0.942 0.010 0.005 0.929 0.954
spec binary 0.924 0.011 0.005 0.911 0.937
j_index binary 0.865 0.007 0.003 0.857 0.874
bal_accuracy binary 0.933 0.003 0.002 0.928 0.937
precision binary 0.971 0.004 0.002 0.966 0.975
f_meas binary 0.956 0.004 0.002 0.951 0.961
  • All general measures of accuracy are increased over the five class case
  • There are more Intermediate samples inflating the precision

5.5 Classification stability

Because cross valdation was run in replicates it is possible to calculate how frequently samples are not assigned the same class across all replicates.

* The majority of instability is within the Intermediate groups * In particular the separation between Intermediate_ADM and Intermediate_ADM_WT is less stable * The second most frequent istability is between the Alpha_like and Intermediate classes

5.6 Most relevant probes

From each cross validation fold the top 25 most relevant probes are extracted.

  • Some probes appear in many replicates
  • About half of the probes is unique to one fold and replicate

The low stability of the probes likely is a consequence of the way probes are weighted in the random forest models.

6 Model validation

All groups
.metric .estimator .estimate
accuracy multiclass 0.808
mcc multiclass 0.740
kap multiclass 0.727
sens macro 0.710
spec macro 0.948
j_index macro 0.657
bal_accuracy macro 0.829
precision macro 0.789
f_meas macro 0.700
Test level: Intermediate
.metric .estimator .estimate
accuracy binary 0.846
mcc binary 0.435
kap binary 0.416
sens binary 0.952
spec binary 0.400
j_index binary 0.352
bal_accuracy binary 0.676
precision binary 0.870
f_meas binary 0.909

This indicates that the class imbalance plays an important part in driving these measures. In the validation set only very few non-Intermediate samples are annotated.

6.1 Proximity of training and the validation sets

For reliable prediction the validation set needs to be located within the range of the training data. Data far outside the training range are likely result in incorrect predictions.

In particular, the validation data is based on EPIC methylation chips, while the training data mostly contains 450K samples.

To test the similarity the validation data is projected into the PCA space defined by the training data. The PCA is based on the DMPs used in consensus clustering.

PCA of training data

  • The training data nicely shows the separation of the groups from consensus clustering

Projection of the validation set

  • The data of the validation data set is well located withing the range of the training data
  • The switch in technology does not seem to affect the data too much

The result of the individual consensus clustering does not match well with the position of the samples in the PCA plot. This is problematic because the consensus clustering is used to define the ground truth. This a large contributor to the reduced performance of the model on the validation data.
In particular the prediction of Alpha_ and Beta_like samples seems off.

Comparison of predicted classes

  • The predicted classes do fit much better to the PCA than the classes from consensus clustering
  • More samples are predicted as Beta_like than expected from the location in the PCA

Another method for defining the ground truth is required.

7 Summary

An initial random forest classifier was developed by Marco Visani. The training was done following the procedure used by Capper et al to construct the brain tumor classifier.

One of the main open questions was the low overlap between probes used by the model and probes used for consensus clustering. Here, training was repeated using the DMPs from consensus clustering. Additionally, the model training procedure was switched to the tidymodels workflow that is more generalized and allows easier switching between different types of classifier.

Model performance is similar or slightly higher compared to the unsupervised model. Tuning of model parameters had little impact on performance. This may be because none of the parameters was chosen to a very low level where performance is expected to degrade. Random forest model often work well with the default parameters and this is confirmed here.
Also two different implementation of random forests ranger and randomForest produced equivalent results.

When measuring the model performance, the imbalance in the classes should be taken into account. Many metrics such as the accuracy are overestimates of model performance in imbalanced cases. The likelihood based metrics roc_auc, pr_auc and gain_curve show how the model performance is different across classes. In particular the Intermediate_ADM and Intermediate_ADM_WT classes seem to be less well defined. The mn_log_loss is lower compared to Marco’s model indicating a better fit.

Cross validation was performed in replicates and shows that the prediction is stable in most cases. Some Intermediate and few Alpha_like cases show instability. This may be connected to intrinsic ambiguity of the class identities. The probes used by the model are often unique to a particular replicate and therefore not very useful to extract biological information.
The weak link between probes and importance is due to the way probe importances are assigned by the random forest model. If there is correlation between probes the importance scores will be distributed randomly across all correlated probes. This most likely is also the reason for the poor link between the probes in the model by Marco and the DMPs.

It is possible to de-correlate the data prior to model fitting. Doing this removes about 1000 probes and slightly degrades model performance. This means that removing these probes also does remove some of the information.

Applying the model to a validation set gives mixed results. The validation set is highly imbalanced and as such many metrics are highly skewed. The overall accuracy is not too bad but the mmc and kap values are quite low. Combining all Intermediate samples allows high recall but this most likely is due to their high number.

Importantly, looking at the similarity of the training and validation data it becomes apparent that the class assignments from the consensus clustering are a quite bad fit to the PCA embedding. Because the individual consensus clustering is used to define the ground truth, having a poor fit does degrade the performance of the model. Otherwise, the fit between training and test data is quite good with no indication that technology does skew the methylation data.

There are several options to improve the model. The most important one is the need to better define the ground truth in the validation set. Hoever, it many also be beneficial to revisit the original consensus clustering. Potential replacements are tree based clustering algorithms such as Leiden or random walk.

Using the DMPs itself is a case of information leakage that may contribute to overly optimistic estimates in cross validation. A better approach would be to select probes from all of the data. This was done by Marco but outside of cross validation. It has been shown that it is important to also cross validate the feature selection steps. An improved model could do this either on the decorrelated probes or after running PCA to deal with the correlation in the methylation signal.

## function (package = NULL) 
## {
##     z <- list()
##     z$R.version <- R.Version()
##     z$platform <- z$R.version$platform
##     if (nzchar(.Platform$r_arch)) 
##         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
##     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
##         "-bit)")
##     z$locale <- Sys.getlocale()
##     z$running <- osVersion
##     z$RNGkind <- RNGkind()
##     if (is.null(package)) {
##         package <- grep("^package:", search(), value = TRUE)
##         keep <- vapply(package, function(x) x == "package:base" || 
##             !is.null(attr(as.environment(x), "path")), NA)
##         package <- .rmpkg(package[keep])
##     }
##     pkgDesc <- lapply(package, packageDescription, encoding = NA)
##     if (length(package) == 0) 
##         stop("no valid packages were specified")
##     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
##         x$Priority == "base")
##     z$basePkgs <- package[basePkgs]
##     if (any(!basePkgs)) {
##         z$otherPkgs <- pkgDesc[!basePkgs]
##         names(z$otherPkgs) <- package[!basePkgs]
##     }
##     loadedOnly <- loadedNamespaces()
##     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
##     if (length(loadedOnly)) {
##         names(loadedOnly) <- loadedOnly
##         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
##         z$loadedOnly <- pkgDesc[loadedOnly]
##     }
##     z$matprod <- as.character(options("matprod"))
##     es <- extSoftVersion()
##     z$BLAS <- as.character(es["BLAS"])
##     z$LAPACK <- La_library()
##     l10n <- l10n_info()
##     if (!is.null(l10n["system.codepage"])) 
##         z$system.codepage <- as.character(l10n["system.codepage"])
##     if (!is.null(l10n["codepage"])) 
##         z$codepage <- as.character(l10n["codepage"])
##     class(z) <- "sessionInfo"
##     z
## }
## <bytecode: 0x55d4f9789c58>
## <environment: namespace:utils>